import pandas as pd import numpy as npfrom lets_plot import*# add the additional libraries you need to import for ML herefrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, confusion_matrix, classification_reportLetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html# Include and execute your code here# import your data here using pandas and the URLdf = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv")df2 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")dfc = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv")
Elevator pitch
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
QUESTION|TASK 1
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
With the first chart, we are using the first csv file to see the number of bedrooms before 1980. There are many 2 and 2 and a half. So being able to see, I merged the first the second csv file to show the stats before and after 1980. With this bar graph, we can see a lot more after the 1980’s
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
So I made a before1980 variable for trainY. I then used features for the trainX. The precision was about a 86% accuracy, which didn’t hit the mark for 90%, but overal using the information I gave it, that’s pretty good.
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
This chart shows 6 different features that were important with the model.
Show the code
importance_df = ( pd.DataFrame({"Feature": x_missData.columns, "Importance": randomEsimate.feature_importances_}) .sort_values("Importance", ascending=False) .head(15) .reset_index(drop=True))# Create Label column for plotting (just copy Feature names)importance_df["Label"] = importance_df["Feature"]# Make Label an ordered categorical so the plot keeps the importance orderingimportance_df["Label"] = pd.Categorical( importance_df["Label"], categories=importance_df.sort_values("Importance")["Label"], ordered=True)# Plot using lets-plotplot = ( ggplot(importance_df, aes(x="Label", y="Importance"))+ geom_bar(stat="identity", fill="#4C72B0")+ coord_flip()+ xlab("Feature")+ ylab("Importance Score"))# Show the plotplot.show()
QUESTION|TASK 4
Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
type your results and analysis here
Show the code
# Include and execute your code here
STRETCH QUESTION|TASK 1
Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
type your results and analysis here
Show the code
# Include and execute your code here
STRETCH QUESTION|TASK 2
Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.
type your results and analysis here
Show the code
# Include and execute your code here
STRETCH QUESTION|TASK 3
Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.